68 research outputs found
Nonnegative/binary matrix factorization with a D-Wave quantum annealer
D-Wave quantum annealers represent a novel computational architecture and
have attracted significant interest, but have been used for few real-world
computations. Machine learning has been identified as an area where quantum
annealing may be useful. Here, we show that the D-Wave 2X can be effectively
used as part of an unsupervised machine learning method. This method can be
used to analyze large datasets. The D-Wave only limits the number of features
that can be extracted from the dataset. We apply this method to learn the
features from a set of facial images
MalwareDNA: Simultaneous Classification of Malware, Malware Families, and Novel Malware
Malware is one of the most dangerous and costly cyber threats to national
security and a crucial factor in modern cyber-space. However, the adoption of
machine learning (ML) based solutions against malware threats has been
relatively slow. Shortcomings in the existing ML approaches are likely
contributing to this problem. The majority of current ML approaches ignore
real-world challenges such as the detection of novel malware. In addition,
proposed ML approaches are often designed either for malware/benign-ware
classification or malware family classification. Here we introduce and showcase
preliminary capabilities of a new method that can perform precise
identification of novel malware families, while also unifying the capability
for malware/benign-ware classification and malware family classification into a
single framework.Comment: Accepted at IEEE ISI 202
Interactive Distillation of Large Single-Topic Corpora of Scientific Papers
Highly specific datasets of scientific literature are important for both
research and education. However, it is difficult to build such datasets at
scale. A common approach is to build these datasets reductively by applying
topic modeling on an established corpus and selecting specific topics. A more
robust but time-consuming approach is to build the dataset constructively in
which a subject matter expert (SME) handpicks documents. This method does not
scale and is prone to error as the dataset grows. Here we showcase a new tool,
based on machine learning, for constructively generating targeted datasets of
scientific literature. Given a small initial "core" corpus of papers, we build
a citation network of documents. At each step of the citation network, we
generate text embeddings and visualize the embeddings through dimensionality
reduction. Papers are kept in the dataset if they are "similar" to the core or
are otherwise pruned through human-in-the-loop selection. Additional insight
into the papers is gained through sub-topic modeling using SeNMFk. We
demonstrate our new tool for literature review by applying it to two different
fields in machine learning.Comment: Accepted at 2023 IEEE ICMLA conferenc
Semi-supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection
Identification of the family to which a malware specimen belongs is essential
in understanding the behavior of the malware and developing mitigation
strategies. Solutions proposed by prior work, however, are often not
practicable due to the lack of realistic evaluation factors. These factors
include learning under class imbalance, the ability to identify new malware,
and the cost of production-quality labeled data. In practice, deployed models
face prominent, rare, and new malware families. At the same time, obtaining a
large quantity of up-to-date labeled malware for training a model can be
expensive. In this paper, we address these problems and propose a novel
hierarchical semi-supervised algorithm, which we call the HNMFk Classifier,
that can be used in the early stages of the malware family labeling process.
Our method is based on non-negative matrix factorization with automatic model
selection, that is, with an estimation of the number of clusters. With HNMFk
Classifier, we exploit the hierarchical structure of the malware data together
with a semi-supervised setup, which enables us to classify malware families
under conditions of extreme class imbalance. Our solution can perform
abstaining predictions, or rejection option, which yields promising results in
the identification of novel malware families and helps with maintaining the
performance of the model when a low quantity of labeled data is used. We
perform bulk classification of nearly 2,900 both rare and prominent malware
families, through static analysis, using nearly 388,000 samples from the
EMBER-2018 corpus. In our experiments, we surpass both supervised and
semi-supervised baseline models with an F1 score of 0.80.Comment: Accepted at ACM TOP
- …